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0 *  Introduction.  Contents  of  this  report. 

We  congratulate  the  University  of  California  on  its  centenary 
and  are  pleased  to  contribute  this  report  in  its  honor.  Our  title  . 
ic  actually  a  slight  misnomer,  since  the  work  we  shall  describe 
begins  with  the  19^8  paper  of  Shannon  [l4i ]  . 

Information  theory  covers  a  multitude  of  subjects  (the  cynic 
might  say  sins)  and  we  would  like  here  briefly  to  Indicate  what 
this  report  will  and  will  not  cover.  It  will  concern  itself  en¬ 
tirely  viith  what  is  often  called  probabilistic  coding  theory. 
Algebraic  coding  theory,  which  could  properly  be  considered  a 
branch  of  information  theory,  will  not  be  included  because  it  is 
largely  outside  the  competence  of  the  authors.  Although  algebraic 
coding  theory  and  probabilistic  coding  theory  are  parallel  and 
complementary  in  one  sense,  their  spirits  and  methods  are  very 
different.  There  are  other  mathematical  disciplines  which  are 
often  incorrectly  lamped  under  information  theory,  principally 
because  they  use  entropy  function  as  a  tool.  It  would  be  as 
incorrect  to  classify  them  under  information  theory  as  it  would  be 
to  call  any  theory  integration  theory  simply  because  it  involves 
the  use  of  the  integral  as  n  tool.  Thus  vie  shall  not  discuss  the 
problems  In  ergodic  theory  which  have  been  solved  by  using  entropy 
as  an  invariant,  nor  problems  of  packing  in  function  spaces,  nor 
the  entropy  of  stochastic  processes,  not  th*»  various  systems  of 
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axiomatizing  entropy. 

Appended  to  this  paper  is  a  bibliography  which  is  reasonably 
complete,  though  not  exhaustive,  I.  is  obviously  impossible  for 
us  to  discuss  every  one  of  these  papers,  particularly  as  the 
editors  of  this  volume,  of  necessity,  have  subjected  us  to  precise 
space  limitations.  The  choice  open  to  us  was  therefore  either 
to  write  an  introductory  exposition  of  information  theory  or  a 
very  technical  paper  for  specialists.  The  first  of  these  choices 
seemed  to  us  not  to  be  in  keeping  with  the  spirit  of  this  volume, 
ancl  the  second  would  result  in  a  paper  which  could  be  read  only 


by  a  small  group  who  might  have  little  need  for  reading  it.  Ue 
have  therefore  decided  to  compromise  between  the  two  choices. 

We  shall  discuss  a  number  of  basic,  typical,  and  important  subjects, 
which  will  enable,  the  non-specialist  reader  to  get  some  of  the 
flavor  and  some  understanding  of  the  theory,  without  at  the  same 
•time. completely  boring  the  specialist  reader.  We  can  only  hope 
that  this  compromise  will  not  cause  us  to  fail  on  both  counts. 

In  order  to  avoid  invidious  comparisons  and  for  other  reasons, 
we  have  decided  to  om.it  actual  citation  of  references  in  the  text. 
There  are  only  two  exceptions  to  this.  One,  a  very  minor  one,  is 
where  we  cite  two  papers  with  seemingly  contradictory  results, 
because  we  wish  to  warn  the  reader  that  they  deal  with  different 
versions  of  a  problem  discussed  below.  The  other,  the  major 
exception,  is  to  refer  freely  to  the  name  and  papers  of  C.E.  Shannon, 
whose  truly  brilliant  work  founded  the  theory  end  produced  many  of 
its  Important  results. 
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*1  •  Discrete  memory less  channels . 


Let.  A  -  a*i  and  3  -  {l,...,bl  be,  respectively,  the 

Input  end  output  alphabets.  The  c lphobo t  that  v/e  use  in  everyday 
life  consists  of  26  Latin  letters,  10  numerical  symbols,  various 
punctuation  marks,  and  a  space  between  words,  which  is  itself  a 
punctuation  mark.  The  alphabets  A  and  3  are  essentially  no 
different  and  no  loss  general.  To  avoid'  the  trivial,  we  assume 
that  both  a  and  b  are  greater  than  1. 

Any  sequence  of  n  letters,  or  elements,  from  A*  { respectively, 
from  Btt)  is  called  a  transmitted  or  sent  h- sequence  (respectively,  a 
received  n- sequence) .  In  any  one  discussion,  n  will  be  fixed.  The 
sender  transmits  n-sequences  over  a  channel.  When  he  sends  such  a 
sequence,  say  uQ,  the  receiver  receives  a  chance  received  n-sequence 
that  is,  the  sequence  received  depends  upon  chance.  Call  the  chance 
received  n-sequence  v(uQ).  Its  distribution  depends  on  uC)  and  the 
channel.  In  fact,  for  mathematical  purposes  the  channel  is  simply 
the  fur.e  tion 

( r. p(v(u0)  *='v0i, 

that  is,  the  probability  that,  whe*  the  n-sequence  u0  is  sent,  the 
chance  received  sequence  should  be  v0;  this  function  is  defined  for 
any  transmitted  n-sequence  uQ  and  any  received  n-sequence  v0.  When 
necessary  to  avoid  confusion,  dependence  on  n  should  be  indicated. 
Usually  the  function  (1.1)  is  defined  for  every  n. 

One  of  the  simplest  and  most  important  of  all  channels  is  the 
discrete  memorylcss  channel  (dmc).  It  is  described  by  means  of  a 
channel  probability  function  (cpf)  v;(j|i),  defined  for  every  i  e  A * 
and  every  j  e  B*.  This  con  be  any  function  for  which  always 
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7  «  1,  i  G  A*. 

j€B* 


Differer. t  functions  w  define  different  drco's. 


Let 


Then 


uo  e  (vv--**an)* 

v0  -  (bi-b2> ‘ • • >bn' • 


P[v(uc) 


n 

fl  w(b  JoJ. 
k=l  .  K  A 


W e  see  that  w(j|i^  can  be ’regarded  as  the  probability  that, 
when  the  letter  i  is  sent,  the  letter  j  is  received.  In  that  case, 
the  individual  letters  received  are  independently  distributed. 

We  now  define  the  notion  of  codes  which  is  «=»  basic  in 
information  theory.  A  code  (n,N,x) ,  where  n  is  the  length  of  each 
word,  N  is  the  length  of  the  code,  and  X  is  the  maximum  probability 
of  error,  is  a  system 


(1,2) 


{ ( *  *  *  *  *  (  9 


where  u.^ . . .  ,uN  are  transmitted  n-sequences,  A-^, . . .  ,AN  are  disjoint 
sets  of  received  n-sequences.  and 

(1.3)  Pj’v(u^)  g  A.^1  )  1-X,  i  —  1 ,  • .  • , N . 

•  *  * 

A  cede  is  used  as  follows:  When  the  sender  wishes  to  send 
the  ith  message,  he  sends  u^.  When  the  message  received  lies  in  Ay 
the  receiver  concludes  that  the  jth  message  was  sent.  If  the 
message  received  does  not  lie  in  any  A,,  he  may  draw  any  conclusion 
he  wishes  about  the  message  that  has  been  sent.  The  probability 
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that  any  message  sent  will  be  correctly  understood  by  the  receiver 
is  at  least  1  -  x. 

One  general  problem  is  this:  For  various  channels  of  interest, 
given  n  and  X,  0  <  x  <  1,  how  big  can  N  be?  Most  of  the  known 
results  are  asymptotic  in  n. 

The  closely  related  problem  of  constructing  the  codes  whose 
existence  is  guaranteed  by  the  theorems  that  will  be  cited  below 
is  as  yet  only  partially  solved. 

Any  vector  with  nonnegative  components  that  add  to  1  may  be 
called  a  probability  dls fcributlon .  A  probability  distribution  on  A* 
(respectively  on  3*)  will  have  a  (respectively  b)  components. 

The  entropy  of  a  probability  distribution  v, 

v  «  (vp  . . .  ,vc) , 

is  defined  to  be 

c 

(1.4)  H(v)  «  -  E  logg  r^. 

i "  1 

Logarithms  to  the  base  2  are  used  for  historical  reasons  only,  and 
any  other  base  would  do  as  well.  If  =  0,  the  1th  term  of  the 
right-hand  member  of  (1.4)  is  defined  to  be  0.  This  last  convention 
always  applies.  The  entropy  function  has  many  important  combina¬ 
torial  properties  which  are  essential  in  the  statement  and  proofs 
of  most  coding  theorems. 

Let  N(n,x)  be  the  length  of  the  longest  code  (i.e.,  of 
maximum  length)  of  word  length  n  and  maximum  probability  of  error 
X.  Obviously  N(n,X)  is  a  monotonically  non-decreasing  function 
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or  x,.  Yeb  the  following •  remarkable  theorem  holds: 
( -1  •  5 )  lim  «i  log  N (n,x)  =  C, 

n-»e>  n  *- 

where  C  is  a  constant,  independent  of  a,  given  by 

(1*6)  max  [H(t')  -  E  -nr.,  H(w(*(i))], 

t  1  1 

where 


(1.7)  ir‘  =  V  t r,  *  • 

K  is  the  matrix  with  w( j  j  i)  the  element  in  the  jth  row  and  i^h 
column,  and  ir.  and  r'  are  probability  distributions  (column' 

vectors)  on  A"  and  B'*,  respectively.  The  number  C  is  called  the 

.  *» 

capacity  of  the  channel.  One  can  say  even  more.'  There  exists 
a  positive  function  K(x)  of  X  such  that,  for  any  n,  there  exists 
a  code  such  that 

(1*8)  N  >  exp 2  {nC  -  ./n  K(x)} 

and  there  does  not  exist  a  code  such  that 


(1-9)  N  >  expg  {nC  +  ,/n  K(x)t. 

(1.8)  is  colled  a  coding  theorem  and  (1.9)  is  called  its 
strong  converse  .  The  weaker  result,  that. always 

(1.10)  N(n,>0  <  cxpg  (S^xr-r) 

is  called  a  weak  converse  . 


v  A  charnel  other  than  the  Cnv:.  baa  a  different  function  (1.1), 
(not  given  by  the  product  of  the  values  of  w(*|*))  and  may  have 
different  alphabets.  Whenever,,  for  ouch  a  channel,  (1.5)  is 
satisfied,  v/e  shall  cell  C  the  capacity  of  the  channel,  Contrary 
to  popular  belief,  not  all  channels  have  a  capacity.  Most 
"reasonable”  channels  of  interest  do. 

There  are  different  and  very  interesting  methods  of  proof  of 
(1.8)  and  (3-9 )>  but  lack  of  space  prevents  our  doing  more  than 
barely  mentioning  them.  One  method  of  proving  (1.8)  is  based  on 
the  fact  that  if  a  code  is  chosen  at  random  (  .'  ) ,  by  a  reasonable 
and  easily  specified  random  process,  the  average  error  (of 
decoding)  is  very  small..  (This  proves  the  existence  of  at  least 
one  code,  and  in  a  sense  implies  that  "most"  codes  have^ small 
probability  of  errori)  In.  a  second  method  of  proving  (1.8)  the 
code  is  built  up  seriatim  and  arbitrarily  until  its  prolongation 
is  impossible;  the  code  is  then  shown  to  have  the  desired  length. 
(This  again. suggests  that  "most"  codes  are  "good".)  A  third 
method  involves  a  method  of  actually  counting  sequences.  This 
last  method  can  also  be  used  to  prove  the  strong  converse.  Another 
method  of  proving  the  strong  converse  essentially  replaces 
counting  sequences  by  measuring  their  volume.  Finally  the  weak 
converse  is  proved  by  a  simple  and  ingenious  manipulation  of 
entropies.  Modifications  and  combinations  of  these  methods  are 
usually  adapted  to  other  channels.  The  proofs  show  up  the 
combinatorial  significance  of  the  various  entropies  which  occur 
in  the  statements  and  proofs  of  the  theorems. 


Consider  now  the  dmc  with  the  following  difference: 

Suppose  the  sender  can  look  over  the  receiver's  shoulder  and  see 
what  the  latter  is  receiving.  The  sender  can  choose  subsequent 

letters  to  be  sent  in  order  to  correct  previous  reception,  but  he 

\ 

can  communicate  with  the  receiver  only  over  the  channel.  The 
capacity  of  this  channel  is  the  same  as  if  there  were  no  feedback} 
This  channel  could  occur  if  an  earthy  expedition  landed  on  the 
moon.  Naturally  the  power  of  the  latter's  transmission  apparatus 
could  not  be  great.  The  receiving  station  on  earth,  however, 
would  have  almost  limitless  power  and  could  report  with  essentially 
perfect  feedback  the  message  actually  received. 

The  term  discrete  is  of  engineering  origin  and  really  means 
finite.  Channels  which  are  not  discrete  have  infinite  input 
or  output  alphabets  or  both.  The  infinite  alphabets  may  be 
countable  or  not.  The  usual  method  of  treating  such  channels  is 
to  approximate  their  alphabets  by  finite  alphabets.  This  is  not 
always  possible  and  often  difficulties  are  encountered.  When  the 
alphabets  are  not  denumerable  measure-theoretic  questions  also  arise. 

Some  references  for  this  section: 

[ X‘<3 ,  [21]  ,  [if]  s  [28]  ,  [29]  ,  fftoj  ,  [4l]  ,  [45]  ,  [49]  , 
[54]  ,  [56]  ,  Ox]  ,  [64 ]  ,  [83]  ,  [85]  ,  Os]  ,  [91]  ,  [94]  , 

[96]  ,  [98]  ,  [99]  »  [lOl]  ,  [l07]  ,  ,[ll2]  ,  [x24]  ,  [l28]  , 

[X29]  ,  [l32]  ,  [l4l]  ,  [.143 J  ,  fl44j  ,  [160]  ,  f  161]  ,  [162]  , 
[l70]  ,  [l73]  ,  [l74]  ,  [l75j  ,  [l84]  . 
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2.  Compound  channels. 

Consider  now  a  dmc  with  this  difference:  Instead  of  a  single 
cpf  w  there  is  given  a  set  S*  of  cpf's,  say  =  £w (•/  •  |s),  ses)  . 
Here  the  third  index,  s,  distinguishes  the  cpf.  The  set  S*  may  have 
infinitely  many  elements.  For  each  s,  w(j|ijs)  is  a  cpf  defined 
for  i  *  l,..«,a  and  J  =  l,...,b.  The  compound  channel  transmits 
as  follows:  Each  word  of  n  letters  (n-sequence)  is  transmitted 
according  to  some  cpf  in  the  cpf  may  vary  arbitrarily  in  from 
one  such  word  to  another. 

Let  P_  nov;  denote  probability  according  to  the  cpf  w( *  J  •  |  s). 
s 

A  code  ( TV )  for  ‘t’^e  compound  channel  is  a  system  (1.2)  with 
all  the  requirements,  except  that  (1.3)  is  replaced  by  the  stronger 
requirement 

( 2 .  J. )  Pg  £  v  ( u^ )  €  ,  i  —  1,  ■  •  •  ,N;  s  £  S. 

Thus,  even  if  Maxwell's  demon  tried  maliciously  to  vary  the  cpf 
so  as  to  make  things  as  difficult  as  possible,  the  probability  that 
any  word  sent  would  be. incorrectly  understood  by  the  receiver  is  ^  ?\  . 

.The  question  is,  how*. long  can  codes  be  and  still  meet  this 
stronger  requirement  (2.1)?  It  must  be  borne  in  mind  that  the  cpf's 
in  iP  may  be  very  "antithetical”  to  each  other.  The  fact  is  that 
theorems  exactly  like  those  for  the  dmc  hold  for  the  compound  channel. 
Thus  the  maximum-  length  of  the  code  depends  on  a  constant  called  (as 
in  the  case  of  dmc)  the  capacity  (C  say'j  of  the  compound  channel. 

If  C  were  0  in  most  cases,  lit.t?.e  could  be  done  with  a  com- 

J * 

pound  channel.  Let  C(s)  be  the  capacity  of  the  dmc  with  the 
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single  cpf  w(  *.|  •  |  .  Define 

,  a 

Cp  **  inf  C(s^  =  inf  max  {II(v'|s)  -  v.  t.,H[w(  •  | i| sYl }. 
scS  ssS  TT  it=i  1 

Then  obviously  we  have  Cg,  for  the  demon  could  use  the  "worst" 

cpf  for  every  word,  fcha’t  is,  the  one  with  the  smallest  capacity. 

(If  S  is  an  infinite  set.  and  there  is  no  worst  cpf,  one  uses  a 
% 

cpf  with  a  capacity  arbitrarily  close  to  the  infimum. )  The  fact 
is  that 

a 

C,  =  .raax  Inf  {H(t'|s)  -  v  v^H[w( • | i| s) ] 1, 

X  TT  B€S  i=l  1 

and,  surprisingly  and  pleasantly,  is  not  0  unless  Cg  is  0. 

Thus  C-^  is  not  0  unless  S*  contains  a  cpi’  whose  capacity  is  0  (or 
a  sequence  whose  capacities  approach  0). 

Consider  now  a  compotmd  channel  as  above  except  that  the 
receiver  now  knows  which  cpf  is  being  used  (but  the  sender  does  not). 
It  has  been  shown  that  the  capacity  of  this  channel  is  also  C^. 

Thus  knowledge  of  the  cpf  by  the  receiver  alone  does  not  increase 
the  capacity.' 

Consider  the  compound  channel  as  above,  except  that  the  cpf  is 
now  known  to  the  sender  but  not  to  the  receiver.  The  capacity  Is 
then  C2,  which  in  general  is  greater  than  C^ 

In  all  of  the  above  results  S*  may  contain  infinitely  many 
elements,  indeed,  non-denumerably  infinitely  many  elements,  and 
the  cpf  is  chosen  arbitrarily  at  the  beginning  of  transmission  of 
each  word  by  the  "jammer".  Nevertheless,  the  fact  that  the  same 
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cpf  (although  arbitrarily  chosen)  is  used  for  every  letter  of  the 

w.ord  has  essent 3. a  1  ly  the  effect  that,  to  a  satisfactory  approximation, 

the  set  S  can  be  replaced  by  a  finite  set  or  at  least  one  \\rith 
CH  /n 

2  ^  cpf’s,  where  c*  is  suitably  chosen.  This  is  always  an  essen¬ 
tial  step  of  the  proof.  Suppose  now  that  the  cpf  varies  arbitrarily 
from  letter  to  letter  of  a  word .  The  above  reduction  is  now  no  long¬ 
er  possible,  previously  used  methods  no  longer  apply,  and  the  problem 
becomes  very  difficult.  Partial  results  have  been  proved  in  [84^ 
and  a  complete  solution  announced  without  proof  in  [3*0  .  Since  a 
theorem  announced  in  [3*0  incompatible  with  a  result  proved  in 
[8*0  it  is  clear  that  the  channels  treated  are  not  the  same.*  While 
awaiting  publication  of  the  results  announced  in  [3*0  and  one 

can  repeat,  without  fear  of  contradiction,  that  the  problems  in- 
vovled  in  these  "arbitrarily  varying  channels"  are  very  difficult. 

Suppose  that  the  cpf  varies  arbitrarily  from  letter  to. letter, 
but  with  some  limitations.  For  example^  suppose  that  the  number  of 
changes  from  one  cpf  to  another  is  not  greater  than  a  fixed  multiple 
of  n^  ,  ok  <C  1.  •  In  the  latter  case  it  is  easy  to  prove  that  the 
problem  can  be  reduced  (and  hence  solved)  to  the  (compound)  case 
where  the  same  cpf  governs  the  transmission  of  each  letter. 

We  spoke  of  the  above  problems  as  3  f  neither  the  sender  nor 

the  receiver  knew  the  cpf.  Of  course  the  problem  of  arbitrarily 

varying  channels  has  been  studied  where  either  or  both  know  the  cpf 

for  each  letter.  In  fact,  the  capacity  of  the  channel  where  both 

know  the  arbitrarily  varying  cpf  is  the  smallest  of  the  cpf's  in 
0 

the1  set  S  .  _ _ 

^Randomized  codes  are  used  in  [3*1]  but  are  not  admitted  in  84j  . 


Perhaps  this  is  the  time  briefly  to  mention  the  question  of 
randomization.  Conceivably  the  sender  could  use  randomized 
encoding,  i.e.,  each  sender's  message  could  be  represented  by  a 
probability  distribution  over  sequences  in  the  input  alphabet. 
After  the  sender  has  decided  on  the  message  he  performs  a  chance 
experiment  with  the  corresponding  probability  distribution  and 
actually  sends  the  resulting  sequence.  Randomized  decoding  is 
defin  d  similarly  in  an  obvious  way.  Randomized  codes  have  been 
studied  to  some  extent,  and  further  studies  are  in  process. 
Generally  speaking,  randomized  decoding  seems  to  provide  little 
advantage,  but  under  certain  conditions  randomized  encoding 
actually  helps  by  either  making  a  longer  code  possible  or  by 
reducing  the  error.  Indeed,  the  author  of  [34]  states  that  his 
general  results  are  valid  only  under  randomized  encoding.  These 
results  are  for  arbitrary  varying  channels,  and  an  explanation  of 
the  need  for  randomized  encoding  may  perhaps  be  the  following. 
Suppose  that  there  is  a  rational  malevolent  being,  the  "jammer" 
say,  who  chooses  the  arbitrarily  varying  cpf's  so  as  to  make 
communication  between  sender  and  receiver  as  difficult  as  possible 
The  utility  of  randomized  encoding  seems  to  be  to  protect  the 
sender  against  the  jammer.  Even  wjien  the  jammer  knows  the  message 
to  be  sent,  if  he  doesn't  know  the  actual  sequence  which  will 
represent  it  he  may  not  be  able  to  choose  the  sequence  of  cpf's 
which  will  most  efficiently  jam  it.  No  such  utility  accrues  to 
randomized  decoding,  and  the  sender  can  do  best  by  voting  for  the 
message  with  the  highest  probability.  (This  is  not  strictly 
correct  in  the  present  model.  The  messages  to  be  sent  are  chosen 


arbitrarily  and  one  cannot  speak  of  the  a  posteriori  probability 
of  r.  message  after  the  resulting  chance  sequence  has  been  received. 
However,  intuitively  this  is  near  enough,  and  it  will  be  made  pre¬ 
cise  in  the  next  paragraph  but  one.)  This  discussion  also  points 
up  the  difference  between  two  channels,  each  with  arbitrarily  vary¬ 
ing  cpf's  (from  letter  to  letter).  In  one  channel  the  jammer  knows 
the  actual  sequence  being  sent  before  it  is  sent,  in  the  other  he 
knows  only  the  probability  distribution  over  input  sequences. 

Naturally  the  capacity  of  the  second  channel  is  not  less  than  the 
capacity  of  the  first  channel.  The  information  theory  literature 
is  sometimes  not  entirely  mathematically  precise  and  such  dis¬ 
tinctions  are  often  made  only  implicitly.  It  is  likely  that  [s;-Q 
treats  the  second  channel;  [ 84  J  certainly  treats  the  first. 

The  prefc  e.ding  remarks  suggest  that  some  problems  in  information 
theory  should  be  viewed  as  zero-sum  two-person  games  between  the 
sender  (and  receiver)  and  the  jammer.  Indeed,  the  form  of  the 
capacities  in  the  several  forms  of  the  compound  (stationary) 
channel,  i.e.  max  inf  and  inf  max  of  the  ’’information  function"  in 
the  above  definitions  of  and  Cg,  suggest  a  game  - 1  h«?  '■•retie  back¬ 

ground  to  the  problem.  A  number  of  writers  have  made  more  or  less  ' 
positive  assertions  about  this,  but  no  specific  proof  of  any  coding 
theorem  or  any  other  important  fact  by  game-theoretic  methods  is 
available  in  the  literature.  If  there  is  an  essential,  non-trivial, 
and  meaningful  connection  between  the  two  theories  it  would  be  very 
interesting  and  useful  to  establish  it  precisely;  it  might  well 
lead  to  further  results  in  coding  theory.  In'  several  papers  cor¬ 
related  encoding  and  decoding  has  been  used.  Kvra  the  sender,  before 
transmitting  any 


! 
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message,'  chooses  a  code  at  random,  communicates  the  result  of  his 

•  , 

random  experiment  to  the  receiver,  and  then  sends  the  message 

according  to  the  code  selected.  This  procedure  is  repeated  at 

each  message.  It  seems  to  the  writers  that  this  procedure  cannot 

seriously  be  considered  as  reflecting  anything  remotely  resembling 

actual  communication.  Surely  it  is  vastly  more  complicated  for  the 

sender  to  transmit  to  the  receiver  the  designation  of  the  code 

which  is  the  outcome  of  the  chance  experiment  than  it  is  to  transmit  the 

message  itself.  Yet  a  new  code  must  be  transmitted  with  each 

however, 

message!  No  doubt  problems  involving  correlated  encoding  and  decoding 
have  mathematical  interest. 

In  most  papers  in  information  theory,  especially  those  written 
by  engineers,  it  is  assumed  that  the  message  to  be  sent  is  itself 
'chosen  at  random  (usually  with  equal  probability  for  each  message). 

When  this  is  so  one  can  speak  of  the  a  posteriori  probability  of 
a  message,  after  it  has  been  passed  through  the  channel  and  the 
chance  sequence  received.  Naturally  the  average  error  is  then 
minimized  if  the  decoder  decides  that  the  message  sent  is  the  one 
with  the  largest  a  posteriori  probability;  this  is  called  maximum 
likelihood  decoding.  Maximum  like?»ihood  decoding  is  simple  and 
unambiguous  only  if  there  is  only  one  cpf  for  the  channel.  •  If 
there  is  more  than  one  cpf  then  different  messages  can  have 
maximum  a  posteriori  probability  according  to  different  cpf's  and. 
often  a  difficult  theory  is  needed  for  a  decision.  Returning  to 
the  first  and  basic  case  studied,  that  of  a  dmc  with  a  single  cpf, 
the  fact  is  that  the  tv/o  cases,  that  of  average  error  with  the 
message  chosen  by  a (known  or  unknown ^  random  mechanism,  and  that 


are  essentially 


of  maximum  error  with  messages  chosen  arbitrarily* 
the  same.  This  is  also  true  of  many  other  channels.  This  fact 
contradicts  the  statement,  made  by  some  very  great  mathematicians 
and  widely  believed  in  engineering  circles^  that  no  theory  is 
possible  without  knowing  the  statistics  of  the  (randomly  chosen) 
message.  For  example,  the  theory  developed  in  Chapters  3  and  h 
of  |15l  neither  assumes  the  existence  of  a  random  mechanism  for 
choosing  messages  nor  mokes  any  use  of  it. 

A  number  of  writers  have  stated,  mostly  without  proof,  that 
there  is  a  basic  and  meaningful  connection  between  information 
theory  and  the  theory  of  statistical  inference,  and  some  of  them 
have  attempted  formally  to  set  up  a  theory  which  would  exhibit  this 
putative  connection.  It  seems  to  us  that  to  establish  a  basic 
and  meaningful  connection  between  two  theories  requires  either 
that  one  obtain  a  common  framework  from  which  one  can  derive 
some  of  the  basic  theorems  in  both  theories,  or  that  one  derive 
important  old  or  nev;  theorems  in  one  theory  by  use  of  theorems 
or  methods  of  the  other.  By  this  essential  standard  no. meaningful 
connection  between  information  theory  and  the  theory  of  statistical 
inference  has  yet  been  established.  Of  course  this  does  not  prove 
that  no  such  connection  exists. 

Some  references  for  this  section: 

[3]  ,  0}  ,  fuj  ,  [j.6j  ,  [17 1  >  Us]  ,  [19]  ,  Tao]  ,  [23]  ,  [sS]  . 

[29]  ,  O']  ,  f  35]  ,  [37]  ,  [47]  ,[69],  [70]  ,  [ 73 ]  ,  [76)  , 

[80]  ,  [8.1.J  .  [ss]  ,  [ 84]  ,  [92]  ,  [95]  ,  [J.02]  ,  [.104]  ,  [105]  , 

[j.23]  ,  [.145]  ,  U54]  ,  [j-56]  ,  (.1.57]  ,  [l7l]  ,  5.76]  ,  [j-77  ], 
[178]  ,  [ J.80]  ,  [182]  ,  [189)  ,  f.194.]  ,  [195]  ,  [l9 8  J. 
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3*  Evroi*  bounds.  Sequential  decoding. 

•Suppose  given  a  dmc  with  capacity  C.  Let  0  <  R  <  C  and  con¬ 
sider,  a3.1  codes  of  word  length  n  and  code  length  2n^'  for  this 
channel.  (In  such  a  cast*  one  is  said  to  be  transmitting  ar  "rate" 
R. )  It  is  not  difficult  to  show  that  there  exists  two  positive 
numbers ,  say  and  Dg,  such  that,  among  these  codes,  there  exist 
one  for  which  "X  ,  the  maximum  error  of  decoding  any  word,  satisfies 
X  <  D  exp  {  ~nD  ]  i  this  is  summarized  by  saying  that  the  error 
decays  exponentially  (with  n) .  It  is  true  for  most  channels,  not 
only  the  dmc,  and  probably  all  channels  of  practical  importance, 
that  the  error  decays  exponentially  with  n.  The  proof  of  this  is 
usually  quite  simple  and  requires  only  an  almost  trivial  modifi¬ 


cation  of  the  proof  of  the  coding  theorem.  An  intuitive  explanation 

is  perhaps  this:  The  codes  we  are  considering  are  of  such  small 

length  (approximately  2*'n^C”R^  of  the  length  they  could  have  for 

a  fixed  ?\ ^  that  there  are  great  gaps  among  the  different  u^’s, 

and  they  can  be  distinguished  (decoded)  by  the  decoder  with  very 

great  accuracy.  Exponential  decay  of  error  is  essentially  due  to 

-I  n 

the  fuct  that  the  probability  that  the  mean  n  X  X.  of  independent 

1  1 

identically  distributed  chance  variables  X^,  X^,,..,X  ,  shall 
exceed  any  fixed  number  larger  than  than  their  common  expected 


value,  decreases  exponentially  with  n. 


The  school  of  electrical  engineers  working  .in  information 
theory,  whose  intellectual  center  is  the  Massachusetts  Institute 
of  Technology,  regards  the  determination  of  the  best  (i.e.,  largest 


possible)  as  one  of  the  principal  and  most  important  problems 
of  Information  theory.  Determinations  of  bounds  on  is 


* 


considered  of  negligible  importance.  The  reason  given  for  the 
importance  of  the  problem  is  that  the  complexity  of  the  apparatus 
for  coding  and  decoding  goes  up,  roughly  speaking,  exponentially 
with  n,  so  that  it  is  important  to  know  the  smallest  n  for  which 
one  can  achieve  a  desired  rate  R  and  a  desired  (usually  small) 
upper  bound  on  the  error.  Even  for  the  dmc  the  problem  is  of  form¬ 
idable  difficulty.  Previous  attempts  consisted  of  using  randomized 
coding  theorem  to  get  a  lower  bound  on  Dg  and  sphere  packing  methods 
to  get  an  upper  bound.  It  was  thought  that  these  two  bounds  agreed 
over  a  certain  range  of  R,  so  that  Dg  was  determined  for  this 
range,  but  errors  were  found  in  the  arguments.  A  recent  new  effort 
has  succeeded  in  determining  Dg  for  part  of  the  range  of  R.  The 
argument  is  difficult  and  c'oes  not  seem  to  lend  itself  to  intuitive  . 
description  or  summary.  At  least  that  part  of  it  which  giv^s  a  lower 
hound  on  Dg  can  be  carried  over  with  little  change  to  many  other 
channels.  The  value  of  Dg  for  all  R  is  as  yet  unknown,  although 
approximations  are  available. 

We  now  turn  to  another  subject  of  major  investigation  among 

engineers,  sequential  decoding.  This  is  one  of  the  most  beautiful 

« 

of  all  ideas  in  eciUmss  theory  and  one  of  the  most  important  for 
practical  application.  Unfortunately  for  the  mathematician,  it 
does  not  seem  to  lend  itself  to  elegent  mathematical  theorems. 

Even  an  approximate  description  of  the  method  would  require  essen¬ 
tially  the  reproduction  of  at  least  a  short  paper  on  the 
subject  or  the  reproduction  of  the  appropriate 

*  for  a  description  of  these  methods  see,  e,g„-  p.  227 


chapter  of  a  book.  This  is  impossible  for  us,  but  perhaps 
the  following  lines  will  help  to  form  some  idea. 

The  actual  application  of  the  codes  hitherto  discussed 
would  always  occur  in  connection  with  a  computer.  The  code 
would  be  stored  in  the  computer  and  the  latter  would  be 
indispensable  in  both  encoding  and  decoding;  the  latter  process 
requires  many  more  computations  than  the  first.  The  volume  of 
naterial  to  be  stored  and  the  number  of  operations  to  be 
performed  increase  exponentially  with  n  and  soon  exceed  the 
capacity  of  even  very  large  modern  computers.  This  raises  the 
problem  of  finding  methods  which  can  be  carried  out  practically. 
Sequential  decoding  is  intended  to  be  such  a  method. 

We  pause  for  a  moment  for  an  intuitive  discussion  of  what 
lakes  efficient  coding.  If  the  transmission  of  any  one  letter 
is  repeated  a  sufficient  number  of  times  then,  except  in  certain 
.obvious  special  cases,  the  decoder  (receiver)  can  identify  the 
letter  being  sent  with  a  probability  as  close  to  one  as  desired. 

In  this  way  any  desired  message  could  be  transmitted  with  any 
desired  degree  of  accuracy.'  The  trouble  with  this  naive  method 
is  that  it  is  grossly  Inefficient;  in  terms  of  our  previous 
p.nrceters,  for  given  ^  and  R  an  enormous  n  is  generally  required. 
«:bnt  makes  for  efficient  decoding  are  the  differences  between 
entire  words  rather  than  between  individual  letters;  the  letters 
of  a  word  reinforce  each  other,  so  to  speak,  so  that  even  if 
viral  letters  are  misunderstood  the  entire  pattern  still 
re:::: ins  clear.  .This  is  called  ’redundancy!'  as  distinct  from  simple 
•"Petition.  For  a  homely  example,  consider  the  problem  of  reading 


3.4 


every  letter  of  a  manuscript  written  In  poor  handwriting,  if 
the  reader  is  familiar  with  the  subject  or  even  the  language 
he  can  often  reconstruct  illegible  letters  or  words  from  the 
context.  This  is  impossible  if  what  is  written  is  made  up  of 
nonsense  syllables  or  material  in  a  completely  unknown  language. 
(Although  in  the  latter  case  one  can  start  looking  for  patterns 
(i.e.,  redundant  elements)  as  crypto-analysts  do.)  The  idea  of 
sequential  decoding  Is  to  introduce  redundancy  into  the  decoding 
of  individual  letters,  while  avoiding  the  construction  of  codes 
which  require  the  storage  of,  and  calculation  with,  exponentially 
many  sequences.  We  should  emphasize,  however,  that  there  is 
not  just  one  method  of  sequential  decoding,  but  a  number  of 
variations.  We  shall  describe  a  typical,  but  by  no  means  unique, 
method. 

In  the  basic  and  simplest  description  of  sequential  decoding 
it  is  assumed  that  the  channel  is  binary  symmetric  and  that  one 
has  the  problem  of  reproducing  a  stream  (doubly- infinite  sequence) 
of  chance  "information"  digits  which  take  the  values  0  and  1  with 
equal  probability,  all  independently  of  each  other.  (A  binary 
symmetric  channel  reproduces  each  of  the  two  digits  correctly 
with  probability  1-p,  say,  and  reverses  the  digits  with  probability 
p. )  Suppose  that  the  digits  actually  realized  are  ...,  m__1,mo, 

.  Let  m  and  k  be  integers,  and  suppose  that  the  rate 
R  =  and  that  mk  is  the  "constraint  length".  Each  information 
digit  will  be  coded  into  m  digits  which  are  then  transmitted  over 
the  channel.  Each  of  these  m  digits  is  a  linear  combination  of 
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the  information  digit  being  sent,  say  m0,  and  its  (k-1) 
immediate  predecessors,  *  *  * ,m-(k-l) '  Hence  the  "effect" 
of  any  information  digit  extends  over  mk  digits,  transmitted 
and  received,  and  one  does  not  decode  this  information  digit 
until  mk  digits  have  been  received;  this  delay  is  a  price  paid 
for  using  the  method.  The  decoder  is  novr  supposed  to  know  the 
preceding  (k-l)  information  digits.  (He  has  decoded  them 
correctly  vrlth  very  high  probability. )  He  begins  a  search 
which  will  end  in  decoding  the  current  information  digit.  This 
search  is  impossible  to  describe  under  our  present  limitations. 


It  is  based  on  the  fact  that,  with  very  high  probability 

(depending  upon  mk)  the  "distance"  (this  depends  on  the  particular 

sequential  decoding  procedure)  between  the  received  sequence  of 

mlc  digits  and  a  transmitted  sequence  of  mk  digits  which  corresponds 

to  any  information  sequence  mQ,  m^,  where  mQ  is  the  digit 

different  from  m0,  is  largo.  By  good  sequential  decoding 

procedures  one  can  relatively  quickly  eliminate  as  possiblitles 

—  a' 

all  sequences  which  start  with  m_,  vrlth. small  probability  of 

O  A 

error.  Here  "relatively  quickly"  refers  to  the  average  number 
of  required  searches  and  computations  as  compared  to  the 


probability  of  error.  The  above  description  is  very  ve*>y  crude 


and  incomplete,  as  any  description  of  this  brevity  must  be,  and 
at  best  can  only  give  an  inkling  of  the  flavor  of  this  beautiful 
idea. 


The  published  results  in  the  literature  of  sequential 
decoding  consist  of  descrix>tions  of  different  schemes,  and  the 
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theorems  are  statements  about  the  expected  number  of  computations 
needed  by  the  scheme  under  certain  conditions  and  the  probability 
of  error  in  decoding.  These  results  are  often  clever  and  ingenious 
and  considerable  difficulties  have  to  be  surmounted  in  obtaining 
them.  Unfortunately  for  the  mathematicians,  the  theorems  are  al¬ 
most  never  elegant,  as  are  the  theorems  in  the  Shannon  theory. 

There  are  no  clear  cut  proofs  that  a  certain  procedure  is  optimal, 

i 

Perhaps  those  will  still  come, 

VJe  close  this  section  with  a  brief  description  of  a  totally 

different  and  also  very  clover  idea  ,  Consider  a  channel  with 

feedback  where  independent,  identically  distributed  Gaussian  errors 

are  added  to  each  transmitted  signal,  and  k  times  the  sum  of  the 

n 

squares  of  the  n  signals  which  represent  a  word  is  bounded  above 
by  a  given  constant  (the  "average  power").  According  to  this 
idea  the  message  is  coded  in  en  arbitrary  but  fixed  manner  into 
one  of  a  sot  of  equally  spaced  points  of  an  interval,  and  this 
point  (number)  is  transmitted ;  this  is  the  first  of  the  sequence 

of  n  signals  which  will  be.  used  to  transmit  the  message.  The 

th 

1  transmitted  signal,  i  =  2, is  a  suitably  chosen  linear 
function  of  the  message  and  all  previously  received  signals.  The 
decoding  of  the  message  after  the  received  signal  is  also 
very  simple:  one  decodes  the  message  sent  as  the  one  corresponding 
to  that  one  of  the  equally  spaced  points  which  is  nearest  to  the 
n  '  received  signal.  It  has  boon  proved  that  this  method  is 
optimal  in  a  very  natural  and  rasonable  sense,  and  it  is  clear 
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from  the  description  that, 
and  decoding  computations, 
serious  drawback.  .Tf  n  is 


it  involves  a  minimum  of  encoding 
Unfortunately,  it  possesses  one  very 
sufficiently  large  the  probability 


is  very  close  to  one  that  at  least  one  of  the  signals  will  require 
for  its  transmission  an  amount  of  power  (l.e.,  the  square  of  the 
signal)  which  exceeds  any  fixed  bound  (and  hence  the  "capacity" 
of  the  instrument). 


Some  references  for  this  section: 
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^  •  Coj, lug, jjj- th  v  fid all  ty  e rlt&j?5.<^n>  -  Work_ of  ths  Russian 
School. 

Shannon  {Jj.^iSj  and  other  writers  have  studied  the  following 
problem:  An  (infinite)  sequence  of  information  digits ,  i.e., 
values  taken  by  a  sequence  of  chance  variables  with  a  known 
distribution  (the  chance  variables  are  usually  independently 
and  identically  distributed,  but  this  is  not  essential)  are 
produced  by  a  source.  After  the  source  has  produced  n  digits 
the  latter  are  coded  into  a  sequence  of  n*  digits;  this  sequence 
is  transmitted  over  a  noisy  channel,  and.  the  received  n’  -  sequence 
is  decoded  into  a  sequence  of  n  information  digits.  (The  actual 
formulation  is  slightly  r*,Aro  general) . 

There  is  e” fidelity  criterion”  which  measurer  the  ' distortion’ 
between  the  sequence  of  n  digits  thus  decoded  anu  the  sequence 
of  n  digits  produced  by  the  source.  The  results  obtained  in 


the  theory  deal  with  such  cues  lions  as  the  in?  nisiWi*.  ratio 


n1 

n 


needed  so  that  the  distortion  not  exceed  a  given,  bound,  and  the 
geometric  problem  of  the  minisr.ua  number  of  r>~  sequences  of  in¬ 
formation.  digits  needed  to  ’span’  the  space  of  such  sequences  of 
information  digits  to  within  a  specified  bound  on  the  distortion. 
We  shall  not  describe  these  results ,  Instead,  we  shall  describe 
the  generalization  of  the  above  nod el  whose  study  h  C;  u<  b  O  W  1*1  Cl 
major  occupation  of  the  Russian  school  of  information  theorists, 
and  shall  describe  v  typical  and  important,  result  of  the  Russian 


school  about  this  model 
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*  *  j.  ' 

1 >  )  *  *  *  *  >  (  ^  >  *2/  )  •  •  *  •  be  a  sequence  of  chance 

variables,  the  pair  (  r^)  being  defined  on  the  probability 

f.pace  !^’a  P^)  with  values  in  the  measure  spaces  (X^a 

.a»id  (yta  $*),  .respectively.  The  information  of  this  pair  (  S  tJ  *( 
relative  to  each  other  is  defined  by 

X\y* 


whore,  p^is  the  Joint  probability  distribution  of  { 

^  t  t  '  t 

dobe.vi.'.inec!  by  P  ,  and  tP'y  Is  the  density  of  with  respect  to 

t  t  '  ^  . 

lVJPr*a  the  product,  of  the  K:a?ginul  iiio&suva’s*  It  is  always 

*  0  .  t  ' 

8U.ppo-5.5d  that  the  interval  in  I  is  finite,  so  that  t7y}/  is 

finite  with  probability  one.  One  denotes  log  by  and 

calls  it  the  information  density.  The  sequence  (  £  ,  )  is 

called  information  stable  if,  for  all  t  sufficiently  large, 

..t.  .  • 

J.  (  ?  ,  )  >  0  and 


^  pjj,  (  $  5  ^  ■ 

T  /  t  t. 


converges  ni ocbastically  to  one. 

Let  and  (X,SC/)  be  two  j/so.cuve  spaces,  respectively  the 

space  of  input  jvosaugeo  and  output  restates ,  ’Let  VJ  be  a  given 
sot  of  distribution  •:  p_.£*  of  chance  vartfbl*?  „  ,  £  with  values 


ere  all  the 


in  X  >•»  X,  such  that  the  marginal  distributions  of  % 
same.  This  set  is  called  the  message  |  VI J f  and  p^  is 
called  the  distribution  of  the  input  message.  Any  pair  ( jv  £  ) 
of  chance  variables  whose  Joint  distribution  belongs  to  ¥  is 
said  to  satisfy  the  conditions  of  reproduction  W.  Without  loss 
of  much  generality  one  limits  one’s  self  to  ¥  defined  as  follows: 
Suppose  given  reaJ  functions  j?^(x,x),  i  -  I,,,.,  M,  defined 
on  X«X  and  measurable  with  respect  to  the  ^-algebra  ,%  <  S^, 

A  A 

and  an  M-dimerisionnl  set  ¥.  ¥  consists  of  all  distributions 

p  «  (with  the  same  marginal  distribution  of )  for  vrhich  the 

4.  u 

vector  whose  il'n  component,  i  «  1, , . .  ,M,  is 


Bf .(*  ,?)> 

•*  «« 


belorgs  to  ¥.  The  '’message  entropy  with  accuracy  of  reproduction 
¥*•'  is  defined  to  bn 


*  K  (¥)  «  inf  X  (  v,?'). 

The  sequence  of  messages  «¥  j  is  called  information,  stable  if 

*t/  ^  f  ^ 

there  exists  an  information  stable  sequence  of  pairs  (  5  *  £  ) 

+  S  V 

such  that  the  t  4  pair  satisfies  the  condition  of  reproduction 


■l‘hG  problem  of  obtaining  general  sufficient  conditions  for  the 

:'.n;'‘oivaation  stability  of  a  sequence  of  messages  is  bound  up 

%  * 

uith  the  problem  of  obtaining  sufficiently  general  conditions 
for  the  information  stability  of  &  sequence  of  pairs  of  chance 
variables  „ 

Let  (Y,  S  )  and  (Y,  S-S)  be  two  measure  spaces  which  serve 
»  Y 

as  the  space  of  input  signals  and  space  of  output  signals, 

*j  />/ 

rer.oecbivo.ly.  Let  Q  (y,  A),  yc  Ys  A  cS  ,  be  &  tran  s  it  ion 
function  such  that  a)  for  fixed  y,  C  (y.  . )  is  a  probability 

measure  on  8^  b)  for  fixed  S’  eJU  £(  . ,  'a’)  is  measurable 
with  respect  to  the  £, -algebra  S  ,  Let  V  be  a  given  sat  of 


\r 

/• 


probability  distributions  on  (Y  >:  Y,  8..,  js  S^.-} .  The  system 
consisting  of  T,  ^  0,  and  V  is  called  "the  transmitter"  and 
will  be  denoted  for  brevity  by  |f,  V  J  .  The  chance  variables 
'^ij’with  values  in  Y  and  Y,  respectively,  are  *’ connected  by 
the  transmitter  jc,  vj  ”  if  their  joint  distribution  belongs  to 
V  and  for  any  %  c  the  conditional  probability 


P{’t*  *lu}“  e  («j,,  'A') 


with  probability  one.  Again,  without  loss  of  much  generality, 

one  limits  one’s  self  to  sets  V  defined  as  follows: 

» 

Suppose  given  real  functions  r;,%  (y,y),  £  •=  1, defined  on 
Y  x  Y,  and  an  1T - d :lm c i \ h i o na  1  sot  V,  The  set  V  of  distributions 
consists  of  all  distributions  of  the  pair  such  that  the 


vector  whose  :L°  component,  :L  «i»  N,  is 

E  *  » 

is  in  V.  When  the  depend  only  on  y  the  constraint  imposed 
by  V  is  on  the  input  signal  only.  The  capacity  of  the  trans¬ 
mitter  |q,vJ  is  defined  to  bo 


C  (C,V)  «  sup  I  ( »  «£) . 
V 


The  sequence  of  transmitters  |  Q*",  is  said  to  be  information 

stable  if  there  exists  an  information  stetyle 'sequence  of  pairs 
of  chance  variables  ( ^ ,  $£)  ouch  that  the  tt!l  pair  is  connected 
by  the  t  transmitter  and  such  that 

Kf 1,0  '  ' 

Aim  1  * _ K  i 

CCQ*.  v*) 

All  published  results  concern  themselves  only  with  information 
stable  sequences  of  transmitters. 

The  message  j  W j ,  is  said  to  be  transmit  sibleby  means  of  the 
transmitter  ic>,vj  if  there  exists  a  sequence  of  four  chance 

A/ 

variables  {  r  ,  9  v  )  such  that:  a)  this  sequence  is 

a  Markov  chain  b)  the  pair  {'%  %)  satisfies  the  conditions  of 
reproduction  W  c)  the  pair  ^  is  connected  by  the  trans¬ 
mitter  |  Q»vJ  ,  The  intuitive  meaning  of  the  above  is  as  follows 


4.6 


The  input  in  the  chance  variable  *  with  given 

distribution  .  The  input  message  C  is  codec!  into  the  input 
signal  s  which  is  sent  over  the  transmitter  (channel)  and 

rJ  /■«/ 

received  as  <£,  ,  Then  is  decoded  into  the  output  message  5  . 

The  transmitter  is  given  and  M  represents  the  desired  accuracy 

of  reproduction.  The  conditional  probability  of  /?£,,  given  *  , 

is  the  randomized  encoding  procedure.  The  conditional  distribution 

of  6?  Van  ^  (i.e.,  0  (<^,  . ) )  is  the  distribution  of  the 

received ’signal.  The  conditional  probability  of  ,  given  ^  s 

is  the  randomised  decoding  procedure, 

*  » 

It  is  easy  to  prove  that  a  necessary  condition  that  the 
message  jtfj  be  transmissible  by  means  of  the  transmitter  |  Q,vJ, 
is  that 

H  (W)  *  C  (S,V). 


A  principal  concern  of  the  writers  of  the  Russian  school  is  to 
prove  that,  asymptotically  and  under  additional  reasonable 
regularity  conditions,  this  condition  is  also  sufficient.  We 

shall  now  describe  a.  typical  and  important  result.  Let  i  W  ’ l 

'  ft  t)  ^  ■ 

be  a  given  sequence  of  messages  and  1C  ,  V  j  a  given  sequence 

of  transmitters.  We  define  the  distance  r  (a,b)  between  two 

points  of  the  same  Euclidean  space  a.?  the  maximum  absolute 

deviation  between  corresponding  components  of  a  and  b.  If  t) 

is  a  set  in.  a  Euclidean  spree  let  [bj  denote  tbs  set  of  all 


points  within  an  r-dietance  of  /at  most  <s  from  some  point  of  U. 

We  now  replace  W  oy  J^lf  J  ^  ,  and  call  the  corresponding  message 
a^s°  replace  V  by  [ylG  and  call  the  corresponding 
transmitter  •  W©  say  that  the  message  |wj  is  transmissible 

by  means  of  the  transmitter  |q3vJ  within  an  event  of  probability  c  , 
if  there  exist  four  chance  variables  «■ >  >  %  >  **  and  a  fifth' 
chance  variable  *%' y  defined  on  the  same  space  as  y  such  that 
a)  (  c > r'l, >#[,>'*')  form  a  Markov  ciiain  b)  (  %  %* )  satisfy 
the  conditions  of  reproduction  W  c)the  pair  3fy)  is 

i  ,  «*w 

connected  by  the  transmitter^  Q,Vj  d)the  probability  that  $  rf‘  %  1 
is  not  greater  than  c  .  Nov;  let  wH  be  a  given  sequence  of 

|  i.  4.  |  ^ 

messages  and  |Q°SV  >  a  given  sequence  of  transmitters,  such  that 
a )  lira  H  (U* )»  o  b)lim  <x 

n  t  it l*\ 


c)  the  number  M*  of  functions  °  in  the  definition  of  the 

t  ^  t 

message  and  the  number  N  of  functions  jr.  in  "the  definition  of 

i 

the  transmitter  are  such  that,  for  every  a>0. 


C  (Q  ,  Vu) 
.t  .... 


V 

«  0  (exp2  |  ft  H(w'b)j') 


*t  /  4* 

N  «  o  (exp 2  |  a  CfQ*,  V  )J  ) 

d)  the  sequence  of  transmitters  |  Q1',  V*}  is  information  stable 

e)  for  some  /  fj, )  }  »  a  sequence  of  pairs  with  respect  to 

^  ft  ty 

vrhich  the  sequence  .|Q  ,V  |  is  information  stable,  for  some 
id  >0  and  for  every  a  >0, 


i+fii 


I rr.  | 


«  c 


=  °(fcXP2{tt  C  (Qt»  yt)}  ) 

f)  the  sequence  {'if'"}  is  information  stable,  and  g)  for  some 
sequence  (  ^  ,  '£  )  of  information  stable  'sequences  of  chance 
variables  with  respect  to  which  is  information  stable 

and  which  also  satisfy  the  conditions  of  reproduction  , 

for  some  b  >0  and  every  a  >09  •  , 


*"*  f- 1 1 


kr--A, » « v  ,H  .  t  i 

«  o(exP2  )« 

Then,  for  every  S>  o  there  exists  a  number  T  such  that  for  t^.T 

the  actsstpge  J Tf?  I  is  transmis?  :‘ble  by  the  transmitter  J  q\  Vg  1 
l  CJ  l  eJ 

within  an  event  of  probability  «  . 11  Under  additional  conditions 

one  can  eliminate  the  phrase  in  quotation  marks.  One  such  set 

of  conditions  is  that  each  of  the  sequences  j  Ul,j  ,  |  N^j  ,  and 

sup  sup  j  p*  (xtx)  I 

•  *  K  #  • 

k  x ,  x 


••up  sup  j  (y,y\i 


k  y,y 


should  be  bounded 
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